Introduction

Los Angeles County provides openly available data on all restaurant and market inspections over the past 5 years. Facilities are subject to inspection 1 to 3 times a year, and made public within 1 week of inspection date. The frequency in which restaurants and food markets are inspected depends on the public health risk associated with the food products served or prepared and on the facility’s history of inspection grades. Inspectors deduct points based on violations and health risks, which is turned into a score out of 100. In addition, Los Angeles County data from 2018 on population health is publicly available. Demographic data, such as age and race distribution, socioeconomic data, such as proportion receiving EBT and proportion employed, and health outcomes data, such as proportion with asthma and rates of suicide, are provided for 87 cities within Los Angeles County.

Objective

I am interested in exploring restaurant inspection ratings in LA County. I have a few questions, with the main one being: Are restaurant inspection ratings associated with community health status? Secondary questions include: What are the “safest” and most “dangerous” cities in LA County for eating restaurant food? What restaurant chain is the “safest” to eat at? What restaurant chains should one proceed with caution?

Methods

Reading in and wrangling the data

I used 2 data sets which I merged together for this project. Both are available at data.lacounty.gov to download as CSV (I have also uploaded these datasets to my github repository). The first is a dataset of all LA County restaurant inspections. I added LA County city health data to see if there were any relationships between public health outcomes and local restaurant hygiene. These datasets were merged by city name, and no restaurant was missing its city in the first dataset. City names were briefly inspected to ensure matching would be feasible. As Los Angeles had many sub-cities for which there was health data, I only included the “City of Los Angeles” data to represent the local public health for any restaurant with city listed as Los Angeles. Only restaurants in cities with health data were included in this analysis.

if (!file.exists("LACinspections.csv.zip"))
  download(
    url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/LACinspections.csv.zip",
    dest = "LACinspections.csv.zip",
    mode="wb"
    )
unzip("LACinspections.csv.zip", exdir="./")
inspect <- read.csv("LACinspections.csv")

if (!file.exists("LAChealth.csv"))
  download(
    url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/main/LAChealth.csv",
    dest = "LAChealth.csv",
    mode="wb"
    )
health <- read.csv("LAChealth.csv")

health$GEONAME <- toupper(health$GEONAME)
health$GEONAME <- replace(health$GEONAME, health$GEONAME=="LOS ANGELES, CITY OF", "LOS ANGELES")

resthealth <- merge(
  x = inspect,
  y = health,
  by.x = "FACILITY_CITY",
  by.y = "GEONAME",
  all.x = FALSE,
  all.y = FALSE
)
resthealth <- data.table(resthealth)

Data Exploration

Data were explored utilizing R package ggplot2 to create a histogram of scores. Implausibly low scores were deleted (score less than 50, of which there was 1 value with score 3). Average scores within cities were computed and compared. Restaurant chains were identified through tokenizing words as bigrams, and looking to see most common chain restaurants. The following set of 9 chains were selected: McDonald’s, Jack in the Box, Starbucks, El Pollo Loco, Panda Express, Taco Bell, Del Taco, In N Out, Panera Bread. Average chain inspection scores were computed and compared. Measures of public health were selected: proportion with depression, proportion with obesity, proportion with diabetes. To explore a chain restaurant inspections scores with its surrounding city’s health, average score for a chain restaurant was calculated.

In the future, I hope to use OpenStreetMaps API to gather latitude and longitude data for each restaurant, so I can map restaurant score as a layer on top of a heat map of city proportion diabetes, obesity, or depression. For now, I was able to use geocode_zip to gather latitude and longitude for most zip codes in the dataset. The zip codes’ average inspection score, proportion of diabetes, obesity, and depression were calculated. Leaflet maps of the averages were created to compare spatially where clean/dirty restaurants are in relation to proportion of the population with Diabetes, obesity, or depression.

# Delete scores less than 50
library(data.table)
resthealth <- resthealth[SCORE>50]
# Find restaurant chains
resthealth %>% unnest_ngrams(word, FACILITY_NAME, n=2) %>% anti_join(stop_words, by = c("word")) # %>% count(word, sort=TRUE) %>% as_tibble() %>% print(n=100)
## Source: local data table [475,719 x 102]
## Call:   `_DT1`[!`_DT2`, on = .(word)]
## 
##   FACILITY_CITY ACTIVI…¹ OWNER…² OWNER…³ FACIL…⁴ RECOR…⁵ PROGR…⁶ PROGR…⁷ PROGR…⁸
##   <chr>         <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <int>
## 1 ALHAMBRA      2021/10… OW0269… SKATE … FA0280… PR0235… SKATES… ACTIVE     1634
## 2 ALHAMBRA      2018/05… OW0031… SANCHE… FA0006… PR0037… BUN N … ACTIVE     1638
## 3 ALHAMBRA      2018/05… OW0031… SANCHE… FA0006… PR0037… BUN N … ACTIVE     1638
## 4 ALHAMBRA      2021/07… OW0033… STARBU… FA0048… PR0005… STARBU… ACTIVE     1633
## 5 ALHAMBRA      2021/07… OW0033… STARBU… FA0048… PR0005… STARBU… ACTIVE     1633
## 6 ALHAMBRA      2019/05… OW0185… SAN TU… FA0179… PR0173… RICK'S… ACTIVE     1638
## # … with 475,713 more rows, 93 more variables: PE_DESCRIPTION <chr>,
## #   FACILITY_ADDRESS <chr>, FACILITY_STATE <chr>, FACILITY_ZIP <int>,
## #   SERVICE_CODE <int>, SERVICE_DESCRIPTION <chr>, SCORE <int>,
## #   SERIAL_NUMBER <chr>, EMPLOYEE_ID <chr>, ObjectId.x <int>, Pop_Tot <int>,
## #   Prop_18y <dbl>, Prop_64y <dbl>, Prop_65y_ <dbl>, Prop_Blk <dbl>,
## #   Prop_Lat <dbl>, Prop_Whi <dbl>, Prop_Asi <dbl>, Prop_Ami <dbl>,
## #   Prop_NHO <dbl>, Prop_FPL1 <dbl>, Prop_FPL2 <dbl>, Prop_forb <dbl>, …
## 
## # Use as.data.table()/as.data.frame()/as_tibble() to access results
# Create chain variable
resthealth$FACILITY_NAME <- toupper(resthealth$FACILITY_NAME)
resthealth$CHAIN <- ifelse(grepl("MCDONALD", resthealth$FACILITY_NAME), "McDonald's", ifelse(grepl("JACK IN THE BOX", resthealth$FACILITY_NAME), "Jack in the Box", ifelse(grepl("STARBUCKS", resthealth$FACILITY_NAME), "Starbucks", ifelse(grepl("EL POLLO LOCO", resthealth$FACILITY_NAME), "El Pollo Loco", ifelse(grepl("PANDA EXPRESS", resthealth$FACILITY_NAME), "Panda Express", ifelse(grepl("TACO BELL", resthealth$FACILITY_NAME), "Taco Bell", ifelse(grepl("DEL TACO", resthealth$FACILITY_NAME), "Del Taco", ifelse(grepl("OUT BURGER", resthealth$FACILITY_NAME), "In N Out", ifelse(grepl("PANERA BREAD", resthealth$FACILITY_NAME), "Panera Bread", NA)))))))))
# Find average inspection score by chain
chain_avg <- resthealth[ , .(
  scoreavg = mean(SCORE) 
), by = "CHAIN"]
# Clean health outcome data to be numeric
resthealth$Prop_obse <- as.numeric(as.character(resthealth$Prop_obse))
## Warning: NAs introduced by coercion
resthealth$Prop_DM <- as.numeric(as.character(resthealth$Prop_DM))
## Warning: NAs introduced by coercion
# Average score by restaurant
zip_avg <- resthealth[ , (ZipAvg = mean(SCORE, by = "FACILITY_ZIP")), list(FACILITY_NAME, FACILITY_ADDRESS, FACILITY_CITY, FACILITY_STATE, FACILITY_ZIP, Prop_DM, Prop_obse, Prop_depr) ]
zip_avg$FACILITY_ZIP <- as.character(zip_avg$FACILITY_ZIP)
# geocode to get lattitude and longitude by zip code
library(zipcodeR)
zip_cord <- geocode_zip(zip_avg$FACILITY_ZIP)
zip <- merge(
  x = zip_avg,
  y = zip_cord,
  by.x = "FACILITY_ZIP",
  by.y = "zipcode",
  all.x = TRUE,
  all.y= FALSE
)

# attempt to get restaurant lat/lng
# failed: geocode_osm , can't install SUNGEO
# failed: tidygeocoder,  takes hours
# myaddress.df <- resthealth[, query := paste0(FACILITY_ADDRESS, ", ", FACILITY_CITY, ", CA ", FACILITY_ZIP)]

# address <- geo(address = myaddress.df$query, method = "osm")

Preliminary Results

In total, there were 252773 inspections of 39423 restaurants in 66 cities within LA County.

Of the 252773 inspections included in the analysis, the average grade was 93.9351117 with a standard deviation of 3.8669866. The highest score was a perfect score, 100 whereas the lowest score was 54. Interestingly, there appears to be a peak at 90, which corresponds to the lowest possible score to achieve an A rating. This may hint at bias involved in the inspection grading process.

library(ggplot2)
scorehisto <- ggplot(resthealth, aes(x=SCORE)) +
  geom_bar(fill="red") +
  ggtitle("Distribution of LA County Restaurant Inspection Scores, 2017-2022") +
  xlab("Inspection Score")
scorehisto

Of the 66 cities included in this analysis, Long Beach had the highest average city inspection score, however there was only 2 inspections in that city, which is similar with East Los Angeles. The small sample size for those cities may bias the average score. The city with the worst average score was Monterey Park.

citysum <- resthealth %>% group_by(FACILITY_CITY) %>% summarise(mean_score = mean(SCORE), sd_fev = sd(SCORE), ninspect = n_distinct(RECORD_ID))


citysum %>% arrange(desc(mean_score)) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score", "Standard Deviation of Score", "Number of Inspections in City"), caption = "Top 5 Cities")
Top 5 Cities
City Average Score Standard Deviation of Score Number of Inspections in City
LONG BEACH 97.37500 0.7440238 2
CALABASAS 96.03195 2.9069571 127
SANTA CLARITA 95.74072 3.2719696 402
MAYWOOD 95.60426 3.4357203 156
EAST LOS ANGELES 95.49254 3.4039778 12
citysum %>% arrange(mean_score) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score", "Standard Deviation of Score", "Number of Inspections in City"), caption = "Bottom 5 Cities")
Bottom 5 Cities
City Average Score Standard Deviation of Score Number of Inspections in City
MONTEREY PARK 91.63380 5.506981 467
ROWLAND HEIGHTS 91.68030 4.965620 429
ALHAMBRA 92.36034 4.893888 477
GARDENA 92.36540 4.176667 685
CERRITOS 92.59487 4.416418 344

Of the chain restaurants examined, Panda Express had the best average score, whereas Del Taco had the lowest average score. Cities with Starbucks and Panera appeared to have the lowest proportion of diabetes. However, depression was highest in cities with Starbucks and Panda Express. Cities with Del Taco and Panera Bread had the highest amount of obesity.

resthealth[ , .(
  avgscore = mean(SCORE, na.rm = T),
  ninspect = n_distinct(RECORD_ID),
  avgDM = mean(Prop_DM, na.rm = T),
  avgMDD = mean(Prop_depr, na.rm = T),
  avgOB = mean(Prop_obse, na.rm = T)
), by = "CHAIN"] %>% na.omit() %>% as_tibble() %>% knitr::kable(col.names= c("Chain", "Average Score", "Number of inspections", "Average Proportion Diabetes", "Average Proportion Depression", "Average Proportion Obesity"), caption = "Average Score by Chain and Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant")
Average Score by Chain and Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant
Chain Average Score Number of inspections Average Proportion Diabetes Average Proportion Depression Average Proportion Obesity
Starbucks 96.62458 419 0.0984833 0.0875348 0.2299143
Taco Bell 96.52941 137 0.1034938 0.0831847 0.2472439
McDonald’s 95.15971 272 0.1036016 0.0826779 0.2438592
In N Out 96.89239 52 0.1005722 0.0750168 0.2412087
Panda Express 97.01446 109 0.1010105 0.0841394 0.2398609
Jack in the Box 94.79830 188 0.1053523 0.0794227 0.2500741
Panera Bread 94.62241 26 0.0976643 0.0812643 0.2606151
El Pollo Loco 94.92152 119 0.1034037 0.0799796 0.2445605
Del Taco 93.16085 76 0.1051764 0.0806202 0.2659273

Is there a correlation between restaurant inspection score and surrounding city health?

While most zip codes appear to have a decent average health inspection score (90+), there appear to be some areas where scores are lower. These include El Monte/Covina, Compton, and Central LA (West Adams). There even appears to be one zip code in Santa Monica with a poor average inspection score. In the following maps, I will explore spatial patterns in proportion of diabetes, obesity, and depression to see if it mirrors patterns found in the food inspection scores by zip code.

score.pal <- colorNumeric("inferno", domain=zip$V1)
scoremap <- leaflet(zip) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(V1), color = ~score.pal(V1),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=score.pal, values = zip$V1, title= "Inspection Score", opacity=1) %>% addTiles() %>%
  addControl("Average Inspection Scores by Zip Code", position = "bottomleft")
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
scoremap

The distribution of the population proportion of diabetes in LA County has a clearer spatial pattern than those of restaurant inspection scores. It appears that diabetes is lowest in communities bordering the ocean and Glendale. Diabetes is most prevalent in South/Southwest/Southeast Los Angeles. These are also some of the communities where restaurant scores are lower.

DM.pal <- colorNumeric("viridis", domain=zip$Prop_DM)
DMmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_DM), color = ~DM.pal(Prop_DM),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=DM.pal, values = zip$Prop_DM, title= "Proportion in City with Diabetes", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
DMmap

Obesity proportion by city follows patterns similar to that of Diabetes. It appears that the city of Inglewood has one of the highest proportion of obesity, and it also does have poorer food inspection grades. South LA appears to be one of the areas with worst restaurant inspection scores and highest in comorbidities.

ob.pal <- colorNumeric("plasma", domain=zip$Prop_obse)
obmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_obse), color = ~ob.pal(Prop_obse),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=ob.pal, values = zip$Prop_obse, title= "Proportion in City with Obesity", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
obmap

Interestingly, depression appears to take a different spatial pattern than diabetes and obesity. Here, depression is clustered around Santa Monica, with one zip code highly prevalent in depression near Inglewood. It does not seem that restaurant hygeine correlates strongly with depression proportions.

dep.pal <- colorNumeric("BuPu", domain=zip$Prop_depr)
depmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_depr), color = ~dep.pal(Prop_depr),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=dep.pal, values = zip$Prop_depr, title= "Proportion in City with Depression", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
depmap

Conclusion

There appears to be some correlation between poor health and poor restaurant inspection scores. The communities where this was apparent were clustered in South LA (ie Compton, Inglewood), and Eastern Los Angeles. Inglewood appears to be one of the hardest hit cities with regards to poor inspection scores, high proportion of diabetes and obesity, and high proportion of depression. The safest chain restaurant appears to be Panda Express, whereas one should be cautious of Del Taco. San Gabriel Valley cities, like Monterey Park, Alhambra, and Rowland Heights, were found to have the lowest average inspection scores in LA County.